Exercise - Validation Metrics for Classification

Load the data (train and test data)
Fit the Logistic Regression
(ASSIGNMENT) Check the accuracy and the AU ROC
Visualize the ROC curve
Discuss metric results

NOTE: Run all cells until the TASK 1 (do not make changes)

By: Hugo Lopes
Learning Unit 11



In [ ]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, precision_score, \
    recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.model_selection import train_test_split
%matplotlib inline



In [ ]:

    
def plot_roc_curve(roc_auc, fpr, tpr):
    # Function to plot ROC Curve
    # Inputs: 
    #     roc_auc - AU ROC value (float)
    #     fpr - false positive rate (output of roc_curve()) array
    #     tpr - true positive rate (output of roc_curve()) array
    
    plt.figure(figsize=(8,6))
    lw = 2
    plt.plot(fpr, tpr, color='orange', lw=lw, label='ROC curve (AUROC = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--', label='random')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.grid()
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

Load an example dataset

Data already prepared for a classifier



In [ ]:

    
df = pd.read_csv('../data/exercise_dataset_LU11.csv')
print('Shape:', df.shape)
df.head()

Divide into Train and Test sets:

X_train: train data
y_train: target of train data
X_test: test data
y_test: target of test data



In [ ]:

    
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], 
                                                    df.iloc[:, 0], 
                                                    test_size=0.33, 
                                                    random_state=42)

Task 1: Fit the LogisticRegression() with the Train Set



In [ ]:

    
# Code here:

Task 2: Get the predictions & scores/probas on the Test Set



In [ ]:

    
# Code here:

Task 3: Get the Accuracy score & AU ROC & ROC Curve



In [ ]:

    
# Code here for accuracy score, AU ROC:



In [ ]:

    
# Code here for ROC curve:

# Call plot_roc_curve():

Task 4: Discuss the results

What do you think about the AU ROC? And what about the accuracy score? Do you think it is high?
Hint: take a look at the class balance.